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ABSTRACT 

Motivation: Prior biological knowledge greatly facilitates the mean- 
ingful interpretation of gene-expression data. Causal networks 
constructed from individual relationships curated from the literature 
are particularly suited for this task, since they create mechanistic 
hypotheses that explain the expression changes observed in datasets. 
Results: We present and discuss a suite of algorithms and tools for 
inferring and scoring regulator networks upstream of gene-expression 
data based on a large-scale causal network derived from the Ingenuity 
Knowledge Base. We extend the method to predict downstream 
effects on biological functions and diseases and demonstrate the 
validity of our approach by applying it to example datasets. 
Availability: The causal analytics tools 'Upstream Regulator Analysis', 
'Mechanistic Networks', 'Causal Network Analysis' and 'Downstream 
Effects Analysis' are implemented and available within Ingenuity 
Pathway Analysis (I PA, http://www.ingenuity.com). 
Supplementary information: Supplementary material is available at 
Bioinformatics online. 
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1 INTRODUCTION 

The interpretation of high-throughput gene-expression data 
is greatly facilitated by the consideration of prior biological 
knowledge. Traditionally this has been done using statistical 
gene- set-enrichment methods where differentially expressed 
genes are intersected with sets of genes that are associated with 
a particular biological function or pathway (Abatangelo et aL, 
2009). One more recent approach involves the application of 
causal networks that integrate previously observed cause-effect 
relationships reported in the literature (Chindelevitch et aL, 
2012a; Felciano et aL, 2013; Kumar et aL 2010; Martin et aL, 
2012; Pollard et aL, 2005). While still depending on statistics, this 
is more powerful than gene-set enrichment since it leverages 
knowledge about the direction of effects rather than mere asso- 
ciations. In this article, we describe causal analysis approaches 
that have been implemented in Ingenuity Pathway Analysis 
(IPA) with particular focus on the details of the underlying 
algorithms, and the application to a number of real-world use 
cases. IPA is a commercial software package and is described in 
the supplementary material. 

Given a gene-expression dataset, our main goals are to eluci- 
date the upstream biological causes and probable downstream 
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effects on cellular and organismal biology. It is critical to infer 
the identity of upstream regulatory molecules and associated 
mechanisms to provide biological insight to the observed expres- 
sion changes. We also aim to predict whether such regulators 
are activated or inhibited based on the distinct up- and down- 
regulation pattern of the expressed genes, and determine which 
causal relationships previously reported in the literature are 
likely relevant for the biological mechanisms underlying the 
data. Upstream regulators are not limited to transcription fac- 
tors; they can be any gene or small molecule that has been 
observed experimentally to affect gene expression in some 
direct or indirect way. A similar approach, relying on the same 
methodology is also used to predict downstream functional ef- 
fects and pheno types. Apart from generating likely mechanistic 
hypotheses, causal inference can also be used to find potential 
upstream regulators with a response opposite to the observed 
expression pattern, which is useful for the prediction of thera- 
peutic compound effects. This application is similar to the 
approach taken by the Connectivity Map tool (Lamb et aL, 
2006), except that here we rely on the wide range of liter ature- 
curated biological findings about compounds and their inter- 
actions instead of a gene-expression database derived from 
in vitro tested compounds. 

The causal network underlying our algorithms is based on the 
Ingenuity Knowledge Base, a large structured collection of 
observations in various experimental contexts with nearly 5 mil- 
lion findings manually curated from the biomedical literature or 
integrated from third-party databases. The network contains 
~40 000 nodes that represent mammalian genes and their prod- 
ucts, chemical compounds, microRNA molecules and biological 
functions. Nodes are connected by ~1 480000 edges representing 
experimentally observed cause-effect relationships that relate 
to expression, transcription, activation, molecular modification 
and transport as well as binding events. Network edges are also 
associated with a direction of the causal effect, i.e. either activat- 
ing or inhibiting. 

We describe four causal analytics algorithms that are available 
in IPA: (i) Upstream Regulator Analysis (URA) determines likely 
upstream regulators that are connected to dataset genes through a 
set of direct or indirect relationships; (ii) Mechanistic Networks 
(MN) builds on URA by connecting regulators that are likely 
part of the same signaling or causal mechanism in hypothesis 
networks; (iii) Causal Network Analysis (CNA) is a generaliza- 
tion of URA that connects upstream regulators to dataset mol- 
ecules but takes advantage of paths that involve more than one 
link (i.e. through intermediate regulators), and can be used to 
generate a more complete picture of possible root causes for the 
observed expression changes; and (iv) Downstream Effects 
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Analysis (DEA) applies the methodology of URA to the inference 
of and impact on biological functions and diseases that are down- 
stream of the genes whose expression has been altered in a 
dataset. Using several examples we show how these tools are 
applied to gene-expression data in practice. 

2 APPROACH 

The inference of upstream regulators needs to be based on stat- 
istics since it cannot be guaranteed that all relationships present 
in the causal network are relevant and actually occur in the given 
experimental context. Also, genes are often modulated by several 
upstream regulators (sometimes with opposing effects), and it is 
not known a priori which will dominate in a particular system. 
Filtering literature findings by specific contexts (e.g. by a par- 
ticular tissue or cell line) generally does not work well because it 
leads to networks that are too sparse for meaningful inference. 
We therefore construct many possible upstream regulators and 
networks serving as hypotheses for the actual biological mech- 
anism underlying the data, and then score those regulators by 
their statistical significance. In particular, we use two scores that 
address two independent aspects of the inference problem: an 
'enrichment' score [Fisher's exact test (FET) P- value] that meas- 
ures overlap of observed and predicted regulated gene sets, and a 
Z-score assessing the match of observed and predicted up/down 
regulation patterns. We find that a Z-score is particularly suited 
for this kind of problem since it serves as both a significance 
measure and a predictor for the activation state of the regulator. 

A similar approach has been reported in Pollard et al. (2005) 
where 'richness' and 'concordance' P-values are used to score 
regulators of expression changes derived from type 2 diabetes 
patients. More recently Chindelevitch et al. (2012a,b) present a 
rigorous discussion of statistical significance in a causal network 
with signed interactions based on a 'ternary dot product distri- 
bution'. This is achieved by transforming the network such 
that edge signs are projected onto the nodes, and exact 
P-values are computed separately for both possible activation 
states of a given upstream regulator. In contrast, the Z-score 
used in the present approach, represents an (asymptotic 
Gaussian) approximation, but is easier to compute and combines 
both cases into one score. 

3 METHODS 

The various causal analysis methods used in IPA are described below. 
3.1 Causal network 

Causal analysis algorithms are based on a 'master' network which is 
derived from the Ingenuity Knowledge Base, and given by a directed 
multigraph G = (V,E) with nodes v e V representing mammalian genes, 
chemicals, protein families, complexes, micro RNA species and biological 
processes, and edges e e E reflecting observed cause-effect relationships. 
For the following let V g c Fbe the set of all genes, and V p c Kthe set of 
all biological processes. For each edge e e E, we define functions a(e) and 
z(e) that map e to its unique source and target nodes, respectively. The 
graph G has no self-edges, i.e. V<? e E : r(e) / o(e). Each edge in e e E is 
associated with a set of underlying findings F(e) obtained from the litera- 
ture, where each finding / e F(e) is associated with a 'sign' 
s{f) e {— 1,0, 1} that represents the regulation direction of the causal 
effect. If s(f) = 1(— 1) effect is activating (inhibiting), and for s(f) = 0, 



the direction of the effect is unknown or ambiguous. Depending on the 
underlying findings, edges are classified into the distinct types, 'T, 'A' and 
'P', represented by three disjoint subsets of E: E u E a and E p . T-edges are 
related to transcription and expression events including protein-DNA 
binding (i.e. regulation of the abundance of the target node), while 
A-edges represent the functional activation or inhibition of the target 
node (e.g. through phosphorylation in a signaling cascade). P-edges are 
associated with the regulation of biological processes (e.g. apoptosis). The 
master network G is a multigraph since two given source and target nodes 
can be connected by a T-edge, and an A-edge at the same time. 

The various finding categories and their respective association with 
edge types and signs are given in a table in the Supplementary 
Material. Findings about changes of molecular modification states (e.g. 
phosphorylation) are included in the A-edge type if an activating or 
inhibiting effect can be inferred. All T-edges are connected to genes 
as their target nodes, e e E t x(e) e V g , and all P-edges connect to 
biological processes, e e E p =$> x{e) e V p . Depending on the signs of the 
underlying findings, each edge e e E is in turn associated with a unique 
direction of the causal effect that is either activating, inhibiting or 
unknown, and represented by the sign s(e) e {—1,0, 1}. In addition, we 
also associate edges with weights w(e) e [0, 1) reflecting our confidence in 
the assigned direction of the effect. Details are given in the Supplementary 
Material. 

3.2 Gene-expression data 

All differentially expressed genes in a given dataset that are also present 
as nodes in the master network form a subset D c V g . The methods 
described here do not take individual expression levels into account but 
instead assume that transcriptionally altered genes have been determined 
using a suitable cut off applied to the measured expression change. 
Each gene in the dataset, d e D, can be either up- or down-regulated 
which is represented by the sign s D (d) e {-1,1}. 

3.3 Upstream Regulator Analysis 

The goal of URA is to identify molecules upstream of the genes in the 
dataset that potentially explain the observed expression changes. Since 
it is a priori unknown which causal edges in the master network are 
applicable to the experimental context, we use a statistical approach to 
determine and score those regulators whose network connections 
to dataset genes as well as associated regulation directions are unlikely 
to occur in a random model. In particular, we define an overlap P-value 
measuring enrichment of network-regulated genes in the dataset, as 
well as an activation Z-score which can be used to find likely regulating 
molecules based on a statistically significant pattern match of up- and 
down-regulation, and also to predict the activation state (either activated 
or inhibited) of a putative regulator. 

Here, we consider transcription and expression (T) edges only by look- 
ing at the subgraph G' = (V, E t ), and defining the subset of genes that are 
regulated by at least one edge in G', 

V rg :={ve V g \3e eE t :v = z(e)}. 

A potential regulator r can be any node in V that is either a gene, protein 
family, complex, micro RNA, or chemical. For a particular given regula- 
tor r e V, we define the set of downstream regulated genes as 

R(r) := {v e V rg \3e e E t : r = o(e) Av = z(e)}. 

For each v e R(r), the sign of v is defined as regulation direction of v 
under the assumption that r is activated, which is given by the regulation 
direction of the connecting edge, as 

s R (r, v) := s(e) where r = cr(e) Av = r(e) and e e E t . 

Similarly we define the weight associated with v to be 

wr(t, v) := w(e) where r = a(e) A v = r(e) and e e E t . 
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3.3.1 Overlap P-value For a particular regulator r the overlap 
P- value p(r) measures enrichment of r-regulated genes in the dataset D 
without taking into account the regulation direction, i.e. independent of 
the edge weight or sign (Fig. 1A). Pro tein-DN A- binding edges with sign 
equal to zero are included. The calculation is based on the one-sided FET 
which assumes a random dataset with a constant number of genes as the 
null model, and where the P-value is given by 

min(c, d) 



P(r)= £ 



A-0 

where n = \ V rg 

0(r)= \R(r)nb\, and a= \0(r)\,b 
1 0(r) I and d—n — a — b-c. 



(a + b)\(c + d)\(a + c)\(b + d)\ 
'-i (a + k)\(b - k)\(c - k)\(d + k)\n\ 



\DnV rg \-\0(r 



3.3.2. Activation Z-score The activation Z-score makes predictions 
about potential regulators by using information about the direction of 
gene regulation. Its purpose is two-fold: first, it can be used to infer the 
activation state of a putative regulator (i.e. whether the regulator is 
activated or inhibited). This is achieved by assessing consistency of the 
pattern match between the up/down gene-regulation pattern and the 
activation/inhibition pattern given by the network relative to a random 
pattern. Second, similar to the overlap P-value, the Z-score can be used to 
determine likely regulators based on statistical significance of the pattern 
match. However, the latter use requires careful assessment whether the 
underlying null model is appropriate, which is discussed in more detail 
below. For the purpose of the activation Z-score we only consider edges 
where the regulation direction is well-defined, i.e. s e {—1,1} (Fig. IB). 
For a particular regulator r the 'overlap' between r-regulated genes and 
the dataset is then given by 

0(f) : = {ve R(r)\s R (r, v) ^ 0 A v e D] 

and we define the activation Z-score as 



z(r): 



E w R( r > v ) s R( r > v)s D (v) 

veO 

^ 1/2 • 

E[wn(r,v)] :% 

ved 



As shown below z(r) is approximately normally distributed for random 
signs s R (r, v) or sd(v), so we can interpret z in the following way: if \z\ is 
high enough (say |z| >2) we consider the match between the signs s R (r, v) 
and sd(v) as being significant. The sign of z then determines whether the 
regulator r is predicted to be activated or inhibited. For z<0 the signs 
s R (r, v) and s R (r, v) are mostly opposite, consistent with the regulator 
being in an inhibited state, while for z>0 both signs are mostly the 
same, consistent with an activated regulator. 

In the following, we take a detailed look at the underlying null 
model: Let N = \0\ and the index i= I, ...,N runs over all nodes 
Vi e O. We consider the product s R (r, Vi)s D (vi) as being represented by 





oo 



Fig. 1. Overlap P-value (A) and activation Z-score (B) calculation (see 
text). In (B) the pointed arrowheads represent activating relationships, 
and the blunt arrowheads represent inhibitory relationships 



independent and identically distributed random variables x t e {—1,1} 
where both values —1 and 1 have equal probabilities 1/2. Expectation 
value and variance of x t are then given by E[x,] = 0 and V[x l ] = 1 . Let 
Wf = w R (r, Vi) be the weights of the edges connecting r to v z . Setting now 
y — E« w t x i wrtn = 0, and V[y] = a 2 = E, w 2 , and noting that 
z = y/a it is seen that approximately z ~ A/"(0, 1) for large enough N. 

The validity of z(r) to test for statistical significance of the regulator 
depends on whether the null model described above is deemed appropri- 
ate. Consider the extreme case where all dataset genes are up-regulated 
and the regulator r has only activating downstream connections. 
Assuming for simplicity that all weights are equal to 1, we then have 
z(r) = y/N which is >2 for N>4. However, this would not be perceived 
as a significant match since any regulator with only activating down- 
stream edges (and N>4) would come up as significant for this dataset. 
A better null model for calling significant regulators is based on rando- 
mizing the data rather than randomly flipping regulation directions. 
One possibility is to permute labels of genes in D while keeping their 
regulation direction and the size of the overlap TV constant. This can be 
achieved approximately by skewing the distribution of the random 
variables x t defined above, i.e. by setting their expectation value \i to a 
non-zero value \jl — hd^r where \jl d (or \jl r ) is given by the expectation 
value of the sign when randomly (and independently) picking a dataset 
gene (or an edge downstream of r). In particular we have \±v = ^E^ 
and /jl r = ^E 5 ^' where the sums run over all N D dataset genes, and 
over all N R r-regulated genes. For the approximation to be valid, we also 
assume that \fi\ is not too close to 1, and for simplicity we keep the 
weights Wj fixed. 

The activation Z-score z(r) reflects a reasonable null model for calling 
significant regulators if the expectation value \i is sufficiently close to 
zero. In practice, we flag all regulators where \p\ >0.25 to indicate that 
the calculated Z-score should not be used for significance calls. In the 
case when the regulation directions of the dataset and downstream causal 
edges are skewed, it is possible to define a 'bias-corrected' Z-score 
(described in the Supplementary Material and available as an option in 
IP A) that can be used to determine statistically significant regulators. 



3.4 Mechanistic Networks: inferring likely causal 
mechanisms 

Regulators determined with URA are not necessarily independent of each 
other. It is for instance possible that the causal effect of a regulator r\ on 
the dataset is relayed through another regulator r 2 , with both regulators 
r\ and r 2 coming up as independent significant hits. One indication for 
this is that r\ and r 2 are themselves connected by a causal edge r\ -»> r 2 
in G. The goal of the MN algorithm is to determine those network edges 
between pre-determined upstream regulators for which there is statistical 
evidence that the corresponding relationship is likely relevant for the 
causal mechanism behind the dataset. The most significant causal edges 
between regulators are then used to construct networks downstream of 
a 'master' regulator in order to indicate possible causal (e.g. signaling) 
mechanisms. 

The algorithm is based on the following idea: if the causal effect of r\ 
on some data set molecule d e D is transmitted through the intermediate 
regulator r 2 , we expect an elevated occurrence of cases where all three 
edges, r\ ->> r 2 ,r 2 —> d and r\ — > d are present in the network, and the 
edge r\ -> d is explained by the path r\ -> r 2 -> d. We therefore look for 
statistical enrichment of these 'causal transitive triangles' (Fig. 2A). This 
enrichment is given by the intersection of the overlaps of the regulated 
gene sets 0(r\) D 0(r 2 ) in the dataset (Fig. 2B and C), i.e. we compute the 
FET P-value with D D R rg serving as the universe. These FET P- values 
are calculated for every edge r\ r 2 in G for which the regulators r x and 
r 2 meet pre-defined cut-offs with respect to their overlap P-value p(r) and 
activation Z-score z(r). For every upstream regulator r, the MN algo- 
rithm then constructs downstream networks with predefined 'breadth' TV 



525 



A.Kramer et al. 




data set data set 



Fig. 2. Enrichment of 'causal transitive triangles' (A) indicates causal 
dependency of upstream regulators A and B [compare (B) versus (C); 
see text] 

and 'depth' K from significant causal edges that connect r to dataset genes 
through several links by using the following recursive algorithm. 

(1) Starting from any upstream regulator r, select TV regulators s t that 
are connected downstream through causal edges with lowest edge 
P- values. 

(2) For each s{. if maximal path length K is reached or a cycle is 
detected, skip, else set r := Sj and go to (1). 

(3) Build network from union of all traversed edges. 

Through the inference of likely relevant causal relationships the MN al- 
gorithm makes a prediction about which previously identified regulators 
are closer to the dataset genes than others in a causal chain of events. 
However, it is possible that the constructed networks are incomplete be- 
cause not enough statistical evidence could be collected for missing edges. 
The algorithm does not enforce consistency of the predicted activation 
states and also accepts protein-protein binding edges as causal connec- 
tions between upstream regulators. 

3.5 Causal Network Analysis: exhaustive enumeration of 
possible root causes 

The CNA algorithm generalizes URA by including paths from regulators 
to regulated molecules that involve more than one edge. For a given 'root' 
regulator r this is done by constructing shortest paths up to a certain 
length from r to dataset molecules, and constructing networks from the 
union of those paths in order to build mechanistic hypotheses. 
Hypothesis networks are then statistically scored using overlap P-value 
and activation Z-scores, and the activation state of the root regulator 
is determined. In contrast to URA, for simplicity, we are not taking 
continuous edge weights into account, but instead set all edge weights 
to 1 if they pass a predefined cut off 8 (set to 8 = 0.25 in the implemen- 
tation). We only consider edges with non-ambiguous directions of regu- 
lation, i.e. non-zero sign s(e). The algorithm operates on the multigraph 
G = (V,E a U E t ), where the included sets of A- and T-edges are given 
by E a / t = {e e E a / t \w(e) > 8}, and the set of regulated genes (through 
T-edges) is V rg = {v e V g \3e e E t : v = r(e)}. 

For each node r e V, we construct all shortest paths P from r to every 
gene v e V rg in G. These paths with length K are represented by sets of 
edges P = {e\, . . . , e^, where e\, ... eK-\ e E a , eK e E t and cr(e\) = r, 
z(eK) = v, r(ei-i) = cr(ei). For any given path P, we define the sign 

s P (P) = Y\s(e) 

eeP 

which represents the composite direction of regulation for a signal passed 
through the path as a causal sequence of events. 

For each pair of nodes re V, v e V rg {r =fi v) with at least one shortest 
path P connecting r and v, we define a 'virtual' edge r -> v representing 
all shortest paths connecting r and v if all those paths are consistent, 
i.e. have the same net effect s P (Fig. 3 A and B). We then collect 
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Fig. 3. Replacing multi-step paths from root regulators to target genes 
(A) by 'virtual' edges with the same net effect (B). The pointed arrow- 
heads represent activating (+1) relationships, and the blunt arrowheads 
represent inhibitory relationships (-1). The dashed lines indicate virtual 
relationships composed of the net effect of the paths between the root 
regulator and the target genes 

all virtual edges corresponding to paths with length K in a set E K . 
Note that E\ = E t and all sets E K are disjoint. The sign s(e) e {—1,1} 
of e e E K is then defined by s(e) '.— s P {P) where P is any shortest path 
associated with e. We proceed by constructing graphs G K composed of 
virtual edges, representing paths with maximal length K: 

g k = (v,\JeX 

Note, that the G K are nested subgraphs of each other, i.e. G K c G L 
if K<L,, with the special case G\ = (V,E t ). 

For every potential regulator r e V the algorithm computes overlap 
P-values Pk{^) and activation Z-scores z K (r) by applying URA (with all 
edge weights set to 1) independently to all networks G K . Hypothesis 
networks are then constructed from the union of all paths P connecting 
r to the dataset molecules. For any given regulator r there may be multiple 
hypotheses corresponding to the different values of K. These hypotheses 
again represent nested subgraphs with their sets of regulated genes given 
by R K (r) = {veV rg \3eeE K :v = x{e) A r = or(e)}, and R K (r) C R L (r) if 
K<L. In practice we only construct hypotheses with K < 3. 

In order to limit the number of networks returned by the algorithm 
we only include hypotheses that add substantial information to the 
'sub'-hypotheses that are contained in the same network, i.e. in the 
spirit of Occam's razor preference is given to simpler networks. In par- 
ticular, we require that (i) the overlap P-value of a hypothesis is more 
significant than the P- value of a contained hypothesis, PK( r )<PK-i( r ), 
and (ii) the set of regulated molecules is different from the set of molecules 
targeted by any individual sub-network. 

We note that all constructed hypothesis networks are completely 
consistent with respect to their direction of edge effects, and may contain 
loops only if a regulator node appears as a dataset molecule in D at the 
same time. 

3.6 Network-bias-corrected P- value 

Overlap P-values p(r) calculated using the method described in Section 
3.3 may be skewed, i.e. may appear too significant because of the presence 
of network hub genes in the dataset. Hub genes are connected to many 
upstream regulators so their occurrence in a hypothesis network is less 
surprising. To correct for this network bias, we also calculate a network- 
bias-corrected P-value that measures significance of the overlap between 
the dataset and regulated genes by comparing to overlaps of random 
datasets with distributions of in-degrees similar to the actual dataset, 
and therefore preserving network topology. To perform this statistic, 
we divide the sets of regulated genes V rg or V rg into bins containing 
only genes whose in-degree lies within a certain range. We sample 
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independently from each bin such that the number of genes drawn cor- 
responds to the number of dataset genes that fall in that bin. The (right- 
sided) P- value is then calculated from the distribution of the overlap with 
the hypothesis-regulated genes as before. If there is only one bin contain- 
ing all genes, the resulting P-value is the regular FET overlap P-value 
defined in Section 3.3. If the number of bins is >1, this P- value needs 
to be calculated numerically by explicit permutation sampling. In our 
implementation, this is done using at most 10000 independent permuta- 
tions of genes (fewer if the estimated P-value>0.01). The minimum 
P- value that can be computed this way is therefore 10 -4 . The computa- 
tion of the network-bias-corrected P-value uses bins based on a logarith- 
mic scale with base 2, so the ranges of in-degrees are {1}, {2,3}, {4, . . . ,7}, 
{8,..., 15}, {16,..., 31}, etc. 

3.7 Downstream Effects Analysis 

The goal of DEA is to identify those biological processes and functions 
that are likely to be causally affected by up- and down-regulated genes. In 
addition, it is also predicted whether those processes are increased or 
decreased. The approach is very similar to that of URA, except that 
the direction of edges connecting the dataset genes with the predicted 
entity (here, the biological process or disease) is reversed. The calculation 
of the overlap P-value is essentially the same as in standard enrichment 
functional analysis (an existing IPA feature). 

For the calculation of the activation Z-score, we consider the graph 
G = (V g U V p ,E p ) as the underlying network, and define edge signs and 
weights as sr(v,p) := s(e) and wr(v,p) := s(e) where v = o{e) A p = z{e) 
and e e E p . For any given process p, the set of genes regulating that 
process is 

R(p) = {ve V g \3e e E p : v = cr(e) Ap = r(e) A s R (v,p) ± 0} 
and the overlap with the dataset is given by 
0(p) = R(p)nD. 

The activation Z-score z(p) is then given by the corresponding formula in 
Section 3.3 and its sign is used to predict whether the downstream process 
is increased or decreased. 



4 IMPLEMENTATION 

IPA enables end-users to execute URA, MN, DEA and CNA 
prediction algorithms when analyzing their datasets. In this web 
application, the user selects a gene-expression dataset to analyze, 
and specifies several optional settings that will tailor the analysis 
as desired. After the analysis job is submitted, IPA performs a 
number of different calculations on the dataset, including the 
prediction algorithms, and produces an analysis result. This 
result can be viewed within IPA and it displays conclusions in 
a variety of ways, depending on the algorithm. 

URA is always executed as part of IPA's dataset analysis, 
and there are no options to choose before running the analysis. 
The results are displayed in a table in which each row shows 



information about a particular regulator and the molecules in 
the dataset that the regulator targets (see example in Fig. 4). 
Columns for predicted state, Z-score, and P-value enable users 
to identify regulators of interest. The table can be sorted and 
filtered, and toolbar operations support the creation of network 
diagrams and lists based on items in the table. 

The MN algorithm is run upon the regulators from the URA 
results, and generated networks are accessible from the 
'Mechanistic Network' column to the far right in the upstream 
regulator table. The column displays the number of dataset mol- 
ecules targeted by the network followed in parenthesis by the 
number of regulators in the network. Clicking on the link in 
this column will display the corresponding network (see example 
in Fig. 5). 

Unlike URA, the MN algorithm can be influenced by a var- 
iety of parameters. When the analysis is initially run, default 
values are used for these settings. After the initial run of the 
analysis, the user can re-run the algorithm with different values 
for P-value and Z-score cut offs, included relationship types, and 
parameters that control the shapes of the resulting networks. 
When the algorithm re-executes, the new mechanistic networks 
will replace the old ones in the result table. 

Like the other prediction features, the CNA executes as part 
of a dataset analysis in IPA. Because of the large number of 
hypothesis networks that are typically returned, the CNA tool 
provides a means to automatically annotate each hypothesis for 
its connection to a particular biological concept such as a disease, 
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Fig. 5. Mechanistic network for beta-estradiol [example (1) in Section 
5.1]. In this network, beta-estradiol is postulated to activate ESR1 (the 
estrogen receptor), NCOA3 (a key estrogen receptor co-regulator) and to 
affect a number of other regulators to explain the gene-expression 
changes in the dataset. The set of regulators in total connect to 320 
dataset genes (not shown), with beta-estradiol connecting directly to 
183 of them 
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Fig. 4. URA result table for example (1) in Section 5.1 (beta-estradiol-treated MCF-7 cells) 
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Fig. 6. DEA results for example (2) in Section 5.2 (TNF -stimulated HUVEC cells). The visualization is a TreeMap (hierarchical heatmap) where the 
major boxes represent a family (or category) of related functions. Each individual colored rectangle is a particular biological function or disease and the 
color orange indicates its predicted state: increasing (orange), or decreasing (blue). Darker colors indicate higher absolute Z-scores. In this default view, 
the size of the rectangles is correlated with increasing overlap significance (using FET P- value). The image has been cropped for better readability 



phenotype, biological process, compound or gene. If the user 
enters such a concept into the tool, then after the hypotheses 
are created, IPA evaluates whether there are known relationships 
in the Ingenuity Knowledge Base between the root regulator of 
each hypothesis and that concept. 

The results of DEA are displayed in a tree map (Shneiderman 
et al, 1992) (see Fig. 6 as an example) which clusters related 
functions together, thus providing a high-level view of the 
function families. The organization makes it easier to see cases 
where related functions are predicted to increase/decrease most 
significantly as a group. The user can then drill down to specific 
functions to see the predictions at a more granular level. 

5 BIOLOGICAL RESULTS FOR EXAMPLE USE 
CASES 

5.1 Upstream Regulator Analysis and Mechanistic 
Networks 

One means to validate the effectiveness of the URA is to test its 
predictive capabilities using datasets derived from cells treated 
with a defined upstream stimulus. To this end, we retrieved from 
GEO (http://www.ncbi.nlm.nih.gov/geo) relevant gene-expres- 
sion datasets which had not been curated by Ingenuity: 

(1) MCF-7 human breast cancer cells exposed to beta-estra- 
diol, a well-known agonist of the alpha and beta estrogen 
receptors (the transcription factors ESR1 and ESR2 in 
humans). Retrieved from GSE11352 (Lin et al, 2007). 

(2) Primary human endothelial cells (HUVEC) stimulated 
with the cytokine TNF. Retrieved from GSE2639 
(Viemann et al, 2006). 

The raw microarray data files were processed through an auto- 
mated feature extraction and normalization pipeline developed at 



Ingenuity Systems (and based on R/Bioconductor), and 
significantly differentially expressed genes were uploaded and ana- 
lyzed in IPA. The three top-most predicted activated upstream 
regulators by Z-score for the dataset involving beta-estradiol trea- 
ted cells are beta-estradiol itself, the more general parent com- 
pound estrogen, and the estrogen receptor (ESR1). A 
mechanistic network generated for beta-estradiol in this analysis 
(see Fig. 5) shows that it may exert its effects on the observed gene 
expression by interacting with ESR 1 (as expected) and also through 
co-activators and other regulatory molecules. The set of 1 1 regula- 
tors plus beta-estradiol in total connects to 320 dataset genes. 

The three top-most predicted inhibited upstream regulators 
(by negative Z-score, data not shown) are tretinoin, fulvestrant 
and 3-deazaneplanocin. Compounds like these that are predicted 
to be 'inhibited' produce expression patterns in the dataset 
partially or completely opposite to the effect observed in the 
literature. This can imply therapeutic utility for these com- 
pounds, by potentially reversing a phenotype observed in the 
disease state. Indeed, fulvestrant is a known selective estrogen 
receptor down-regulator approved for treatment of hormone re- 
ceptor positive metastatic breast cancer. Tretinoin is the active 
metabolite of vitamin A (also known as all-trans retinoic acid) 
and is an RXR agonist that has been shown to inhibit the pro- 
liferation of MCF-7 cells and is indicated for certain cancers 
(Arteaga et al, 2013; Wang et al, 2000; Zhu et al, 1997). It 
may affect proliferation in MCF-7 cells by interfering with the 
estrogen receptor's ability to activate its downstream target genes 
(Ombra et al, 2013) or through other means (Muller et al, 2002). 
3-deazaneplanocin A (also known as DZNep) affects histone 
methylation and recent work suggests it might have therapeutic 
value for breast cancer (Hay den et al, 2011). 

In the second example of the HUVEC cells treated with the 
cytokine TNF, the top-predicted activated regulators are 
TNF itself, followed by lipopolysaccharide, NF-/cB complex, 
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IFNG and poly rI:rC-RNA. Several of these regulators initi- 
ate an inflammatory response much like TNF does (i.e. lipopoly- 
saccharide and poly rI:rC-RNA), and NF-/cB is the key 
downstream transcription factor activated when TNF activates 
its receptor. To further validate the algorithm, we retrieved nine 
additional gene-expression datasets from GEO all derived from 
mammalian cells directly exposed to TNF. The 10 datasets in 
total come from human, mouse, rat and cow, and include a 
variety of cell types such as fibroblasts, epithelial cells and mono- 
cytes, and are from several technology platforms including 
Affymetrix, Agilent and Illumina (RNA-Seq). In eight of 10 of 
these datasets, IPA predicted TNF as the top-activated regulator, 
and ranked it second and third in the other two datasets (see 
Supplementary Tables). This high level of consistency lends sup- 
port to the premise that the URA has predictive power across 
mammalian cell types and species, and is independent of expres- 
sion measurement technology. 

5.2 Downstream Effects Analysis 

To highlight the utility of DEA we examined its predictions for 
the HUVEC cells treated with TNF. As shown in Figure 6, a large 
number or biological processes are predicted to be increased by 
TNF, especially those in the 'Hematological System Development 
and Function', 'Cellular Movement', 'Immune cell Trafficking', 
'Cell-to-Cell Signaling and Interaction' and 'Inflammatory 
Response' categories. Examples of specific increased function 
among those categories are 'leukocyte migration' (Z-score 
5.006), 'inflammatory response' (Z-score 4.578) and 'quantity of 
T lymphocytes' (Z-score 3.934). Increases in these types of biolo- 
gical functions are expected for TNF. 

5.3 Causal Network Analysis 

The power of CNA is its ability to detect novel master upstream 
regulators that operate though other regulators, especially in 
cases where few or no relationships exist directly between it 
and the dataset genes. To highlight this capability we obtained 
a gene-expression dataset derived from a mouse model of anky- 
losing spondylitis (Haynes et ai, 2012). This disease is character- 
ized by initial inflammation that progresses to osteoproliferation 
leading to inappropriate bone formation and eventually joint 



fusion. The pathology can be produced in mouse by injecting 
human cartilage proteoglycan extract several times over a 
12-week period by which time inflammation leads to bone 
overgrowth in the spine. 

The authors sought to find evidence in the expression data that 
the Wnt signaling system was perturbed. By using RT-PCR and 
immunohistochemistry they were able to show that the Wnt 
pathway inhibitory molecules DKK1 and SOST were slightly 
reduced in concentration both at the mRNA and protein levels 
in the spinal tissue. Using CNA in IPA with the authors' micro- 
array-based mRNA expression data as input, we found that 
SOST was predicted to be significantly inhibited with a Z-score 
of —1.96, with a network depth of 3, meaning that some paths 
from SOST to the dataset molecules involve three distinct 'hops' 
(with two intervening regulators), such as the path SOST 
Smad — ► EGFR — >► LCN2 as shown in Figure 7. The gene for 
SOST itself was not detectably differentially expressed in this 
experiment based on the microarray data, highlighting the sen- 
sitivity of the causal analysis. 

As a point of interest, the canonical pathway in IPA which 
shares the highest overlap with the regulators in this causal net- 
work is the pathway called 'Role of Osteoblasts, Osteoclasts 
and Chondrocytes in Rheumatoid Arthritis', (containing 
SOST, the Smad family, SMAD1, BMP6, and RUNX2), 
indicating the relevance of SOST and its downstream targets 
both in bone homeostasis and inflammatory disease. 

6 DISCUSSION 

This article describes algorithms, tools and visualizations recently 
added to IPA that enable scientists to combine the directional 
information encoded in their gene-expression datasets with 
knowledge extracted from the literature to infer the underlying 
causes of their observed transcriptional changes and to predict 
likely outcomes. This is a significant advance compared to tools 
that just look for statistical enrichment in overlap to sets of 
genes. An aspect of upstream analysis that makes it especially 
powerful is that it is essentially an activity-based prediction, in 
that the measurement of the activity of a transcriptional regula- 
tor is via the measurement of genes known to be differentially 
expressed by it in a defined direction. The data being analyzed 




Fig. 7. CNA result for SOST (see Section 5.3). SOST is the master regulator (or 'root' regulator) of a small causal network containing five intermediate 
regulators that may explain the up and down regulation of the 26 dataset molecules shown in the bottom row (red indicates up-regulation and green 
down-regulation). The regulators are colored by their predicted activation state: activated (orange) or inhibited (blue). Darker colors indicate higher 
absolute Z- scores. The edges connecting the nodes are colored orange when leading to activation of the downstream node, blue when leading to its 
inhibition, and yellow if the findings underlying the relationship are inconsistent with the state of the downstream node. Pointed arrowheads indicate that 
the downstream node is expected to be activated if the upstream node connected to it is activated, while blunt arrowheads indicate that the downstream 
node is expected to be inhibited if the upstream node that connects to it is activated 
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has measured that differential expression. This can be contrasted 
with for example pathway overlap prediction, where in general 
there is no guarantee that the members of the pathway are 
differentially expressed upon pathway activation or inhibition. 

As described above, the algorithms operate over a large-scale 
causal graph assembled from individual literature- supported re- 
lationships between molecules, diseases and biological processes 
in the Ingenuity Knowledge Base. These relationships are derived 
from a myriad of experimental systems in mouse, rat and human. 
Though possibly counter-intuitive to mix relationships derived 
from a variety of contexts to analyze data from a particular 
context, the results are generally valid due to the conserved ac- 
tivity of most genes and proteins across tissues and cell types. 
That being said, it may be possible in the future to automatically 
infer a context based on the expression data and to filter accord- 
ingly to only relevant relationships for a particular analysis. 

There are at least 30 papers that have already made use of 
these new causal tools in IPA. One of the first to use URA 
demonstrated how transcription factor activation differs between 
mouse, macaque, and swine in response to infection by the 2009 
pandemic H1N1 influenza virus (Go et al., 2012). Another used 
both URA and DEA to help characterize the mechanisms that 
provide breast cancer protection during pregnancy (Meier-Abt 
et a/., 20 13). There are certain to be additional applications of 
these powerful analytics as scientists discover and adopt them 
in their research. 
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